Phoebus: A System for Extracting and Integrating Data from Unstructured and Ungrammatical Sources

نویسندگان

  • Matthew Michelson
  • Craig A. Knoblock
چکیده

With the proliferation of online classifieds and auctions comes a new need to meaningfully search and organize the items for sale. However, since the seller’s item descriptions are not structured and do not conform to a standard set of values (think “Chevy” versus “Chevrolet”), searching and organizing this data is difficult. This paper describes a working demonstration of the Phoebus system which uses both record linkage and information extraction to parse out the meaningful attributes of an item description and assign them standard values. This allows the data to be sorted, searched and linked to other data sources where standard values for the attributes are required to link the sources together.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Reference-set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Inste...

متن کامل

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look∗

There exist numerous sources of data on the World Wide Web that contain useful information but are not structured or grammatical enough to support traditional information extraction. Furthermore, even if the information extraction could be done, the extracted values would need to be standardized to ensure the queries over the source are accurate. This paper presents an automatic, scalable appro...

متن کامل

Application of Big Data Analytics in Power Distribution Network

Smart grid enhances optimization in generation, distribution and consumption of the electricity by integrating information and communication technologies into the grid. Today, utilities are moving towards smart grid applications, most common one being deployment of smart meters in advanced metering infrastructure, and the first technical challenge they face is the huge volume of data generated ...

متن کامل

Impact of Feed Sources and Feeding System on Milk Production and Marketing in the Babille District of East Hararghe Zone, Ethiopia

The aim of this article was to investigate the impact of feed sources and feeding system on milk production and milk marketing in the Babille district of Eastern Hararghe zone. Data were collected using a structured questionnaire which was administered to 152 randomly selected sample dairy cow keepers in the district. Data was analyzed using descriptive methods and regression analysis. Data fro...

متن کامل

Integrating the Population Perspective into Health System Performance Assessment (IPHA): Study Protocol for a Cross-Sectional Study in Germany Linking Survey and Claims Data of Statutorily and Privately Insured

Background Health system performance assessment (HSPA) is a major tool for evidence-based governance in health systems and patient/population-orientation is increasingly considered as an important aspect. The IPHA study aims (1) to undertake a comprehensive performance assessment of the German health system from a population perspec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006